feat: Use PartitionValueExtractor interface in Spark reader path #17850

suryaprasanna · 2026-01-13T21:20:35Z

Describe the issue this Pull Request addresses

This PR enables the use of custom PartitionValueExtractor implementations when reading Hudi tables in Spark, allowing users to define custom logic for extracting partition values from partition paths.
Previously, the PartitionValueExtractor interface was only used during write/sync operations but not during read operations.

Summary and Changelog

Users can now configure custom partition value extractors for read operations using the hoodie.datasource.read.partition.value.extractor.class option, enabling support for non-standard partition path formats.

Changes:

Moved PartitionValueExtractor interface from hudi-sync-common to hudi-common for broader accessibility
Added PARTITION_VALUE_EXTRACTOR_CLASS config to HoodieTableConfig and DataSourceOptions
Updated HoodieSparkUtils.createPartitionSchema() to use PartitionValueExtractor for partition value extraction
Updated HoodieFileIndex and related classes to support custom partition value extractors
Added test implementation TestCustomSlashPartitionValueExtractor demonstrating date formatting from slash-separated paths (yyyy/mm/dd → yyyy-mm-dd)
Updated all existing PartitionValueExtractor implementations to use the relocated interface

Impact

Public API Changes:

New config option: hoodie.datasource.read.partition.value.extractor.class for Spark reads
PartitionValueExtractor interface relocated from org.apache.hudi.sync.common.model to org.apache.hudi.hive.sync

User-Facing Changes:
Users can now customize partition value extraction during read operations by providing a custom PartitionValueExtractor implementation

Risk Level

Low - This is an additive change that maintains backward compatibility. When no custom extractor is specified, the default behavior remains unchanged (standard slash-based splitting). The change has been tested with custom partition value extractor implementations.

Documentation Update

Documentation should be updated to include:

New config option hoodie.datasource.read.partition.value.extractor.class in the configuration reference (WIP)
Example usage of custom PartitionValueExtractor implementations for read operations (WIP)

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

nsivabalan · 2026-02-05T19:38:54Z

PartitionValueExtractor interface relocated from org.apache.hudi.sync.common.model to org.apache.hudi.hive.sync

Wouldn't this break existing users who might have their own implementations for PartitionValueExtractor?

We should avoid doing this.

Instead, can we introduce a new interface named SparkPartitionValueExtrator in some spark other package. If need be, this can extend from existing PartitionValueExtractor as well.
and use this in spark read code paths.

nsivabalan · 2026-02-05T19:41:07Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala

      Array.fill(partitionColumns.length)(UTF8String.fromString(partitionPath))
+    } else if(usePartitionValueExtractorOnRead && !StringUtils.isNullOrEmpty(partitionValueExtractorClass)) {
+      try {
+        val partitionValueExtractor = Class.forName(partitionValueExtractorClass)


can we move this to a private method to keep this method lean

may be, parsePartitionValuesBasedOnPartitionValueExtrator

nsivabalan · 2026-02-05T19:44:15Z

hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java

      .withDocumentation("Key Generator type to determine key generator class");

+  public static final ConfigProperty<String> PARTITION_VALUE_EXTRACTOR_CLASS = ConfigProperty
+      .key("hoodie.datasource.hive_sync.partition_extractor_class")


none of the table property will have "hoodie.datasource. as prefixes.

we should define two configs.

hoodie.datasource.hive_sync.partition_extractor_class for writer property.
and
hoodie.table.hive_sync.partition_extractor_class for table config.

users should not be able to directly set the table property. they should always set the writer property only i.e. hoodie.datasource.hive_sync.partition_extractor_class

Seems like we are storing configs with prefix as hoodie.datasource like follows, maybe we need to change for them as well.
hoodie.datasource.write.drop.partition.columns
hoodie.datasource.write.hive_style_partitioning
Anyways, I have created two different properties, one will be in HoodieTableConfig and other will be in HoodieSynnConfig. Also added validation to make sure people dont give different values for this.

nsivabalan · 2026-02-05T19:45:37Z

hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java

+            .or(() -> Option.ofNullable(cfg.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME)));
+
+        if (!partitionFieldsOpt.isPresent()) {
+          return Option.empty();


is this not NonPartitionedExtractor ?

Yes, made the change.

nsivabalan · 2026-02-05T19:46:54Z

hudi-common/src/main/java/org/apache/hudi/hive/sync/PartitionValueExtractor.java

 */

-package org.apache.hudi.sync.common.model;
+package org.apache.hudi.hive.sync;


lets not change this

Yeah, my bad I am using the same pacakge now, but I need to move the interface to hudi-common as there is a compile time dependency now.

nsivabalan · 2026-02-05T19:47:28Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala


+  val USE_PARTITION_VALUE_EXTRACTOR_ON_READ: ConfigProperty[String] = ConfigProperty
+    .key("hoodie.datasource.read.partition.value.using.partion-value-extractor-class")
+    .defaultValue("true")


can we disable by default.

since we have an infer function, this might get exercised out of the box. lets keep OOB behavior untouched.

Sure, makes sense.

nsivabalan · 2026-02-05T20:21:22Z

...src/test/scala/org/apache/spark/sql/hudi/common/TestCustomSlashPartitionValueExtractor.scala

+
+import java.util
+
+class TestCustomSlashPartitionValueExtractor extends PartitionValueExtractor {


Made the changes.

nsivabalan · 2026-02-05T20:22:06Z

hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncConfig.java

      .markAdvanced()
      .withDocumentation("Field in the table to use for determining hive partition columns.");

-  public static final ConfigProperty<String> META_SYNC_PARTITION_EXTRACTOR_CLASS = ConfigProperty


lets leave this as is w/o any changes

Sure, reverting this change.

nsivabalan · 2026-02-05T20:25:11Z

...spark/src/test/scala/org/apache/spark/sql/hudi/common/TestCustomParitionValueExtractor.scala

+        Seq(7, "a7", 7000, "2024-01-03", "CAN", "ON", "TOR")
+      )
+
+      // Test partition pruning with combined date and state filter


proper way to assert here is. we should corrupt one of the parquet files in one another partition which does not match the predicate. and then, the query will only succeed if the partition pruning really worked. if not, query will hit FileNotFoundException.
But we can't afford to do it for every query. since one of the data file will be corrupted. So, may be we can do it for 1 or 2 of the partition pruning queries you have here.

Added the test case with corrupted parquet file.

nsivabalan · 2026-02-05T20:28:09Z

hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java

+  public static final ConfigProperty<String> PARTITION_VALUE_EXTRACTOR_CLASS = ConfigProperty
+      .key("hoodie.datasource.hive_sync.partition_extractor_class")
+      .defaultValue("org.apache.hudi.hive.MultiPartKeysValueExtractor")
+      .withInferFunction(cfg -> {


we should not be setting any default here right?
ok to have the infer function.

we can't use a config key w/ "hive_sync" in the name.
We plan to use this for reading.

May be hoodie.table.partition_value_extractor_class.

but we might have to introduce new partition value extractor classes for the read instead of using the same one we use for hive sync.

let me think about it more or chat w/ others to see how we can go about this

nsivabalan · 2026-02-05T20:29:50Z

...spark/src/test/scala/org/apache/spark/sql/hudi/common/TestCustomParitionValueExtractor.scala

+
+      metadataTable.close()
+    }
+  }


I understand hive style partitioning and the custom partition value extractor are mutually exclusive.

but can we add a test w/ custom partition value extractor and url encoding enabled.

Create unit test Fix the test custom partition value extractor interface

hudi-bot · 2026-02-10T01:28:03Z

CI report:

f3e6402 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan

hey @yihua : can you take a sneak peak into this PR.
lets chat about how we can go about it.

nsivabalan · 2026-02-10T08:28:48Z

hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java

+  public static final ConfigProperty<String> PARTITION_VALUE_EXTRACTOR_CLASS = ConfigProperty
+      .key("hoodie.datasource.hive_sync.partition_extractor_class")
+      .defaultValue("org.apache.hudi.hive.MultiPartKeysValueExtractor")
+      .withInferFunction(cfg -> {


we can't use a config key w/ "hive_sync" in the name.
We plan to use this for reading.

May be hoodie.table.partition_value_extractor_class.

but we might have to introduce new partition value extractor classes for the read instead of using the same one we use for hive sync.

let me think about it more or chat w/ others to see how we can go about this

nsivabalan · 2026-02-10T08:30:33Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala

        }
+
+        // Validate partition value extractor
+        val currentPartitionValueExtractor = params.getOrElse(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key(), null)


we should make a new writer property.
lets align before we go ahead w/ more changes

nsivabalan · 2026-02-10T08:33:29Z

hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncConfig.java

+
        if (!partitionFieldsOpt.isPresent()) {
-          return Option.empty();
+          return Option.of("org.apache.hudi.hive.NonPartitionedExtractor");


can we move this into a separate PR?
looks like a bug fix right.

nsivabalan · 2026-02-10T08:34:02Z

hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncConfig.java

+                && cfg.getString(KeyGeneratorOptions.HIVE_STYLE_PARTITIONING_ENABLE.key()).equals("true")) {
              return Option.of("org.apache.hudi.hive.HiveStylePartitionValueExtractor");
-            } else if (cfg.contains(SLASH_SEPARATED_DATE_PARTITIONING)
-                && cfg.getString(SLASH_SEPARATED_DATE_PARTITIONING).equals("true")) {


why changing these?

nsivabalan · 2026-02-10T08:34:48Z

hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncConfig.java

      .markAdvanced()
      .withDocumentation("Class which implements PartitionValueExtractor to extract the partition values, "
-          + "default 'org.apache.hudi.hive.MultiPartKeysValueExtractor'.");
+          + "default is inferred based on partition configuration.");


lets chat about how does the new config interplays w/ existing hive sync config

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Jan 13, 2026

suryaprasanna force-pushed the use-partition-value-extractors-in-reader-path branch from eb515ef to d5a979e Compare February 2, 2026 02:14

nsivabalan reviewed Feb 5, 2026

View reviewed changes

suryaprasanna force-pushed the use-partition-value-extractors-in-reader-path branch from d1d62f3 to 908fd75 Compare February 9, 2026 07:29

suryaprasanna added 7 commits February 9, 2026 17:22

Use PartitionValueExtractor interface on read

427ade0

Create unit test Fix the test custom partition value extractor interface

Add necessary imports

04fd206

Create unit tests and removed unnecessary variables

08ff52f

Create unit tests

58c4a76

WIP

0ac8d92

Address review comments

db26b6e

Fix failed unit test

de1f911

suryaprasanna force-pushed the use-partition-value-extractors-in-reader-path branch from febf1da to de1f911 Compare February 9, 2026 17:23

Fix flink test failures

f3e6402

apache deleted a comment from hudi-bot Feb 10, 2026

nsivabalan reviewed Feb 10, 2026

View reviewed changes


		import java.util

		class TestCustomSlashPartitionValueExtractor extends PartitionValueExtractor {

feat: Use PartitionValueExtractor interface in Spark reader path #17850

Are you sure you want to change the base?

feat: Use PartitionValueExtractor interface in Spark reader path #17850

Conversation

suryaprasanna commented Jan 13, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

nsivabalan commented Feb 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Feb 10, 2026

CI report:

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development